An Empirical Evaluation of MapReduce under Interruptions
نویسندگان
چکیده
The presence of interruptions is an unwanted but inevitable fact that all large-scale distributed computing systems have to face. The interruptions are more prevailed for MapReduce applications, as often MapReduce runs on the top of the commodity hardware based clusters, which are more vulnerable than traditional HEC systems. The problem is further exaggerated when running MapReduce applications in distributed nondedicated computing environment, where the host applications have the privilege to take back the computing power at random and interrupt MapReduce applications. This study intends to evaluate the resilience of MapReduce applications through an empirical evaluation. In particular, we set up a MapReduce system, inject interruptions with different patterns, and study their impact on the performance of the TeraSort benchmark. We simulate both cluster and distributed non-dedicated computing environment to observe the impact of these interruptions. Both the data locality and benchmark execution time have been measured. We also vary the number of replicas to observe its impact on the application performance. The experimental results show that interruptions have a significant impact on the performance of MapReduce applications. MapReduce in distributed computing environment is more vulnerable to interruptions due to the high data migration cost. Finally, we show that extra data replicas help to mitigate the impact of interruptions.
منابع مشابه
A New Data Mining Algorithm based on MapReduce and Hadoop
The goal of data mining is to discover hidden useful information in large databases. Mining frequent patterns from transaction databases is an important problem in data mining. As the database size increases, the computation time and required memory also increase. Base on this, we use the MapReduce programming mode which has parallel processing ability to analysis the large-scale network. All t...
متن کاملMapReduce vs. Pipelining Counting Triangles
In this paper we follow an alternative approach named pipeline, to implement a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To be concrete, we implement a dynamic pipeline of processes and an ad-hoc MapReduce version using the language Go....
متن کاملA Methodology for Understanding MapReduce Performance Under Diverse Workloads
MapReduce is a popular, but still insufficiently understood paradigm for large-scale, distributed, data-intensive computation. The variety of MapReduce applications and deployment environments makes it difficult to model MapReduce performance and generalize design improvements. In this paper, we present a methodology to understand performance tradeoffs for MapReduce workloads. Using production ...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملAdaptive Information Passing for Early State Pruning in MapReduce Data Processing Workflows
MapReduce data processing workflows often consist of multiple cycles where each cycle hosts the execution of some data processing operators e.g., join, defined in a program. A common situation is that many data items that are propagated along in a workflow, end up being “fruitless” i.e. they do not contribute to the final output. Given that the dominant costs associated with MapReduce processin...
متن کامل